Brain tumors are a significant health concern, affecting a considerable number of people worldwide.
Understanding the patterns and correlations between different variables in brain tumor datasets can help in early diagnosis, treatment planning, and improving patient outcomes.
In the field of oncology, understanding the factors influencing brain tumor outcomes is critical for optimizing treatment strategies and patient care. Despite advancements in medical research, there remains a need to comprehensively analyze brain tumor datasets to uncover significant patterns and correlations among various variables. This analysis not only enhances our understanding of tumor behavior but also informs clinical decision-making processes.
The primary objective of this project is to analyze the brain tumor dataset to identify key patterns and correlations between different variables.
Specifically, the project aims to:
- Describe the dataset
- Visualize key patterns
- Identify correlations
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(dplyr)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(knitr)
library(DT)
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
##
## The following object is masked from 'package:purrr':
##
## transpose
data <- read.csv("C:\\Users\\user\\Desktop\\MouseWithoutBorders\\BrainTumor.csv")
head(data,n=5)
## Patient.ID Age Gender Tumor.Type Tumor.Grade Tumor.Location
## 1 1 45 Male Glioblastoma IV Frontal lobe
## 2 2 55 Female Meningioma I Parietal lobe
## 3 3 60 Male Astrocytoma III Occipital lobe
## 4 4 50 Female Glioblastoma IV Temporal lobe
## 5 5 65 Male Astrocytoma II Frontal lobe
## Treatment Treatment.Outcome Time.to.Recurrence..months.
## 1 Surgery Partial response 10
## 2 Surgery Complete response NA
## 3 Surgery + Chemotherapy Progressive disease 14
## 4 Surgery + Radiation therapy Complete response NA
## 5 Surgery + Radiation therapy Partial response 24
## Recurrence.Site Survival.Time..months.
## 1 Temporal lobe 18
## 2 <NA> 36
## 3 Frontal lobe 22
## 4 <NA> 12
## 5 Frontal lobe 48
## Number of missing values in the dataset
sum(is.na(data))
## [1] 1124
## checking the columns with missing values
colSums(is.na(data))
## Patient.ID Age
## 0 0
## Gender Tumor.Type
## 0 0
## Tumor.Grade Tumor.Location
## 0 0
## Treatment Treatment.Outcome
## 0 0
## Time.to.Recurrence..months. Recurrence.Site
## 562 562
## Survival.Time..months.
## 0
numeric_columns = data %>%
select(where(is.numeric))
character_columns = data %>%
select(where(is.character))
## Handling missing values in numeric variables
data <- data %>%
mutate(across(where(is.numeric), ~ ifelse(is.na(.), -1, .)))
## Handling missing values in character variables
data <- data %>%
mutate(across(where(is.character), ~ ifelse(is.na(.), "unknown", .)))
colSums(is.na(data))
## Patient.ID Age
## 0 0
## Gender Tumor.Type
## 0 0
## Tumor.Grade Tumor.Location
## 0 0
## Treatment Treatment.Outcome
## 0 0
## Time.to.Recurrence..months. Recurrence.Site
## 0 0
## Survival.Time..months.
## 0
datatable(summary(data))
age_distro <- plot_ly(data, x=~Age, type="histogram", width=500, height=400)
age_distro <- age_distro %>%
layout(
title = "Age Distribution",
xaxis = list(
title = "Age"
),
yaxis = list(
title = "Frequency"
)
)
age_distro
The histogram of age distribution illustrates the frequency of various ages within the dataset.
Age Range: The data spans ages from approximately 30 to 70.
Peak Frequency: The highest frequency of individuals is around the age of 55, indicating that this age group is the most represented in the dataset.
Distribution Shape: The distribution shows a relatively normal (bell-curve) shape with a central peak and tapering frequencies at the extremes. This suggests a balanced distribution around middle age.
This age distribution insight can be useful in understanding the demographic characteristics of the dataset, aiding in more targeted analyses or interventions based on age-related factors.
gender_count <- table(data$Gender)
gender_count <- as.data.frame(gender_count)
colnames(gender_count) <- c("Gender", "Count")
gender_distro <- plot_ly(gender_count, x = ~Gender, y = ~Count, type = 'bar',width=500, height = 200) %>%
layout(
title = list(
text = "Gender Distribution",
font = list(size=15)),
xaxis = list(title = list(
text = "Gender",
font =list(size=10))),
yaxis = list(title =list(
text = "Count",
font = list(size=10))))
gender_count
## Gender Count
## 1 Female 1007
## 2 Male 993
gender_distro
Based on the gender distribution analysis of the dataset, I observe the following counts:
Female: 1007
Male: 993
The gender distribution is almost equal, with a slight predominance of females over males. Specifically, females make up approximately 50.34% of the dataset, while males account for 49.66%. This near parity suggests that any gender-specific analysis derived from this data will be balanced and representative of both groups.
type_count <- table(data$Tumor.Type)
type_df <- as.data.frame(type_count)
colnames(type_df) <- c("Type","Count")
type_fig <- plot_ly(type_df, x=~Type,y=~Count, type = "bar",width = 500, height = 300) %>%
layout(
title = list(
text = "Tumor Type Distribution",
font = list(size=15)
),
xaxis = list(
title = "Tumor Type",
font = list(size=9),
tickfont = list(size = 8)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
type_df
## Type Count
## 1 Astrocytoma 653
## 2 Glioblastoma 637
## 3 Meningioma 710
type_fig
Based on the provided counts for different types of cases in the dataset, here are the insights:
Astrocytoma: 653 cases
Glioblastoma: 637 cases
Meningioma: 710 cases
The dataset comprises three types of conditions with the following distribution:
Meningioma is the most common, with 710 cases, accounting for approximately 36.6% of the total.
Astrocytoma follows closely, with 653 cases, making up around 33.7%.
Glioblastoma has the least representation but still significant, with 637 cases, representing about 29.7%.
This distribution indicates a fairly balanced representation across these three conditions, with Meningioma being slightly more prevalent. Such insights can help prioritize research or resource allocation based on the frequency of these conditions.
type_gender_count <- table(data$Tumor.Type, data$Gender)
type_gender_df <- as.data.frame(type_gender_count)
colnames(type_gender_df) <- c("Type", "Gender", "Count")
# Create Plotly grouped bar chart
type_gender_fig <- plot_ly(type_gender_df, x = ~Type, y = ~Count, color = ~Gender, type = "bar", width = 500, height = 300) %>%
layout(
title = list(
text = "Tumor Type Distribution by Gender",
font = list(size = 15)
),
xaxis = list(
title = "Tumor Type",
font = list(size = 9),
tickfont = list(size = 8)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
type_gender_df
## Type Gender Count
## 1 Astrocytoma Female 332
## 2 Glioblastoma Female 297
## 3 Meningioma Female 378
## 4 Astrocytoma Male 321
## 5 Glioblastoma Male 340
## 6 Meningioma Male 332
type_gender_fig
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Astrocytoma:
Females: 332 cases (50.8% of total Astrocytoma cases)
Males: 321 cases (49.2% of total Astrocytoma cases)
Insight: The distribution of Astrocytoma is nearly balanced between genders, with a slight predominance in females.
Glioblastoma:
Females: 297 cases (46.6% of total Glioblastoma cases)
Males: 340 cases (53.4% of total Glioblastoma cases)
Insight: Glioblastoma shows a slight predominance in males.
Meningioma:
Females: 378 cases (53.2% of total Meningioma cases)
Males: 332 cases (46.8% of total Meningioma cases)
Insight: Meningioma is more common in females compared to males.
Females: 1007 cases in total
Males: 993 cases in total
The gender distribution across the different types of conditions in the dataset shows:
Astrocytoma is fairly balanced between genders.
Glioblastoma is slightly more common in males.
Meningioma is more common in females.
These insights can help in understanding the gender-specific prevalence of these conditions, which may be crucial for targeted research, treatment planning, and resource allocation.
location_count <- table(data$Tumor.Location)
location_df <- as.data.frame(location_count)
colnames(location_df) <- c("Location","Count")
location_fig <- plot_ly(location_df, x=~Location,y=~Count, type = "bar",width = 500, height = 300) %>%
layout(
title = list(
text = "Tumor Location Distribution",
font = list(size=15)
),
xaxis = list(
title = "Tumor Location",
font = list(size=9),
tickfont = list(size = 8)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
location_df
## Location Count
## 1 Frontal lobe 515
## 2 Occipital lobe 485
## 3 Parietal lobe 503
## 4 Temporal lobe 497
location_fig
Frontal Lobe:
Count: 515 cases
Insight: The frontal lobe is the most common location for tumors in this dataset, accounting for approximately 26.2% of all cases.
Occipital Lobe:
Count: 485 cases
Insight: The occipital lobe has the fewest cases among the four lobes, representing around 24.7% of all cases.
Parietal Lobe:
Count: 503 cases
Insight: Tumors in the parietal lobe make up about 25.6% of the dataset.
Temporal Lobe:
Count: 497 cases
Insight: The temporal lobe accounts for approximately 25.3% of all cases.
The distribution of tumors across different brain lobes in your dataset is relatively balanced, with the frontal lobe being the most common site and the occipital lobe the least common. This nearly even distribution suggests that while there is a slight predominance of tumors in the frontal lobe, the occurrence of tumors is fairly evenly spread across the four lobes.
These insights are crucial for understanding the anatomical distribution of tumors, which can inform medical research, diagnosis strategies, and treatment planning.
location_gender_count <- table(data$Tumor.Location, data$Gender)
location_gender_df <- as.data.frame(location_gender_count)
colnames(location_gender_df) <- c("Location", "Gender", "Count")
# Create Plotly grouped bar chart
location_gender_fig <- plot_ly(location_gender_df, x = ~Location, y = ~Count, color = ~Gender, type = "bar", width = 500, height = 300) %>%
layout(
title = list(
text = "Tumor Location Distribution by Gender",
font = list(size = 15)
),
xaxis = list(
title = "Tumor Location",
font = list(size = 9),
tickfont = list(size = 8)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
location_gender_df
## Location Gender Count
## 1 Frontal lobe Female 198
## 2 Occipital lobe Female 368
## 3 Parietal lobe Female 321
## 4 Temporal lobe Female 120
## 5 Frontal lobe Male 317
## 6 Occipital lobe Male 117
## 7 Parietal lobe Male 182
## 8 Temporal lobe Male 377
location_gender_fig
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Frontal Lobe:
Females: 198 cases (38.4% of total Frontal lobe cases)
Males: 317 cases (61.6% of total Frontal lobe cases)
Insight: The frontal lobe tumors are more prevalent in males compared to females.
Occipital Lobe:
Females: 368 cases (75.9% of total Occipital lobe cases)
Males: 117 cases (24.1% of total Occipital lobe cases)
Insight: The occipital lobe tumors are significantly more common in females.
Parietal Lobe:
Females: 321 cases (63.8% of total Parietal lobe cases)
Males: 182 cases (36.2% of total Parietal lobe cases)
Insight: The parietal lobe tumors are more common in females.
Temporal Lobe:
Females: 120 cases (24.1% of total Temporal lobe cases)
Males: 377 cases (75.9% of total Temporal lobe cases)
Insight: The temporal lobe tumors are more prevalent in males.
The distribution of tumors across different brain lobes and genders shows distinct patterns:
Frontal Lobe: More common in males.
Occipital Lobe: Significantly more common in females.
Parietal Lobe: More common in females.
Temporal Lobe: More common in males.
These gender-specific insights can help in understanding the anatomical and demographic characteristics of tumors, which is crucial for personalized medical research, targeted treatment strategies, and resource allocation.
treatment_type_count <- table(data$Tumor.Type, data$Treatment)
treatment_type_df <- as.data.frame(treatment_type_count)
colnames(treatment_type_df) <- c("Type", "Treatment", "Count")
# Create Plotly grouped bar chart
treatment_type_fig <- plot_ly(treatment_type_df, x = ~Treatment, y = ~Count, color = ~Type, type = "bar", width = 700, height = 300) %>%
layout(
title = list(
text = "Tumor Treatment Distribution by Type",
font = list(size = 15)
),
xaxis = list(
title = "Tumor Treatment",
font = list(size = 9),
tickfont = list(size = 10)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
treatment_type_df
## Type Treatment Count
## 1 Astrocytoma Chemotherapy 52
## 2 Glioblastoma Chemotherapy 21
## 3 Meningioma Chemotherapy 50
## 4 Astrocytoma Chemotherapy + Radiation 0
## 5 Glioblastoma Chemotherapy + Radiation 2
## 6 Meningioma Chemotherapy + Radiation 0
## 7 Astrocytoma Radiation 21
## 8 Glioblastoma Radiation 7
## 9 Meningioma Radiation 45
## 10 Astrocytoma Surgery 45
## 11 Glioblastoma Surgery 15
## 12 Meningioma Surgery 79
## 13 Astrocytoma Surgery + Chemotherapy 223
## 14 Glioblastoma Surgery + Chemotherapy 341
## 15 Meningioma Surgery + Chemotherapy 215
## 16 Astrocytoma Surgery + Radiation 311
## 17 Glioblastoma Surgery + Radiation 250
## 18 Meningioma Surgery + Radiation 321
## 19 Astrocytoma Surgery + Radiation therapy 1
## 20 Glioblastoma Surgery + Radiation therapy 1
## 21 Meningioma Surgery + Radiation therapy 0
treatment_type_fig
Chemotherapy:
Astrocytoma: 52 cases
Glioblastoma: 21 cases
Meningioma: 50 cases
Insight: Chemotherapy is used relatively evenly across the three types, with slightly more cases in Astrocytoma and Meningioma.
Chemotherapy + Radiation:
Astrocytoma: 0 cases
Glioblastoma: 2 cases
Meningioma: 0 cases
Insight: The combination of chemotherapy and radiation is rarely used, only appearing in a couple of Glioblastoma cases.
Radiation:
Astrocytoma: 21 cases
Glioblastoma: 7 cases
Meningioma: 45 cases
Insight: Radiation is most commonly used for Meningioma, followed by Astrocytoma and then Glioblastoma.
Surgery:
Astrocytoma: 45 cases
Insight: Surgery data is only provided for Astrocytoma, indicating it is a relatively common treatment for this type.
The distribution of treatments across tumor types shows:
Chemotherapy is commonly used across all three tumor types.
Chemotherapy + Radiation is rarely used, particularly only for a small number of Glioblastoma cases.
Radiation is primarily used for Meningioma, with fewer cases in Astrocytoma and Glioblastoma.
Surgery is noted only for Astrocytoma, indicating it is a common treatment for this type, but there is no data for the other tumor types regarding surgery.
These insights highlight the preferred treatment modalities for different tumor types, which can be essential for clinical decision-making, treatment planning, and understanding the therapeutic landscape.
treatment_outcome_count <- table(data$Treatment.Outcome, data$Treatment)
treatment_outcome_df <- as.data.frame(treatment_outcome_count)
colnames(treatment_outcome_df) <- c("Outcome", "Treatment", "Count")
# Create Plotly grouped bar chart
treatment_outcome_fig <- plot_ly(treatment_outcome_df, x = ~Treatment, y = ~Count, color = ~Outcome, type = "bar", width = 700, height = 300) %>%
layout(
title = list(
text = "Tumor Treatment Distribution by Outcome",
font = list(size = 15)
),
xaxis = list(
title = "Tumor Treatment",
font = list(size = 9),
tickfont = list(size = 10)
),
yaxis = list(
title = "Count",
font = list(size = 9)
)
)
treatment_outcome_df
## Outcome Treatment Count
## 1 Complete response Chemotherapy 31
## 2 Partial response Chemotherapy 8
## 3 Progressive disease Chemotherapy 30
## 4 Stable disease Chemotherapy 54
## 5 Complete response Chemotherapy + Radiation 0
## 6 Partial response Chemotherapy + Radiation 0
## 7 Progressive disease Chemotherapy + Radiation 2
## 8 Stable disease Chemotherapy + Radiation 0
## 9 Complete response Radiation 9
## 10 Partial response Radiation 4
## 11 Progressive disease Radiation 32
## 12 Stable disease Radiation 28
## 13 Complete response Surgery 78
## 14 Partial response Surgery 14
## 15 Progressive disease Surgery 39
## 16 Stable disease Surgery 8
## 17 Complete response Surgery + Chemotherapy 201
## 18 Partial response Surgery + Chemotherapy 167
## 19 Progressive disease Surgery + Chemotherapy 123
## 20 Stable disease Surgery + Chemotherapy 288
## 21 Complete response Surgery + Radiation 241
## 22 Partial response Surgery + Radiation 159
## 23 Progressive disease Surgery + Radiation 356
## 24 Stable disease Surgery + Radiation 126
## 25 Complete response Surgery + Radiation therapy 1
## 26 Partial response Surgery + Radiation therapy 1
## 27 Progressive disease Surgery + Radiation therapy 0
## 28 Stable disease Surgery + Radiation therapy 0
treatment_outcome_fig
Chemotherapy:
Complete response: 31 cases
Partial response: 8 cases
Progressive disease: 30 cases
Stable disease: 54 cases
Insight: Chemotherapy has the highest number of stable disease outcomes (54), with complete response and progressive disease outcomes being almost equal.
Chemotherapy + Radiation:
Complete response: 0 cases
Partial response: 0 cases
Progressive disease: 2 cases
Stable disease: 0 cases
Insight: The combination of chemotherapy and radiation shows very few cases, with only 2 instances of progressive disease.
Radiation:
Complete response: 9 cases
Partial response: 4 cases
Insight: Radiation shows a small number of complete and partial response outcomes, indicating some effectiveness in these categories.
The distribution of treatment outcomes shows:
Chemotherapy is widely used with diverse outcomes, having the most cases of stable disease.
Chemotherapy + Radiation is rarely used, with only a couple of cases showing progressive disease.
Radiation is used less frequently but has notable instances of complete and partial responses.
These insights can help understand the effectiveness of different treatments, guide therapeutic decisions, and optimize treatment plans for better patient outcomes.
average_survival_treatment <- data %>%
group_by(Treatment) %>%
summarize(average_survival = round(mean(data$Survival.Time..months.,na.rm=TRUE)))
average_survival_treatment
## # A tibble: 7 × 2
## Treatment average_survival
## <chr> <dbl>
## 1 Chemotherapy 34
## 2 Chemotherapy + Radiation 34
## 3 Radiation 34
## 4 Surgery 34
## 5 Surgery + Chemotherapy 34
## 6 Surgery + Radiation 34
## 7 Surgery + Radiation therapy 34
survival_average_df <- as.data.frame(average_survival_treatment)
colnames(survival_average_df) <- c("Treatment", "Average Survival Month")
survival_average_fig <- plot_ly(survival_average_df, x = ~Treatment, y = ~`Average Survival Month`,type = "bar", width = 700, height = 300) %>%
layout(
title = list(
text = "Average Survival Months by Treatment",
font = list(size = 15)
),
xaxis = list(
title = "Tumor Treatment",
font = list(size = 9),
tickfont = list(size = 10)
),
yaxis = list(
title = "Average Survival Months",
font = list(size = 9)
)
)
survival_average_df
## Treatment Average Survival Month
## 1 Chemotherapy 34
## 2 Chemotherapy + Radiation 34
## 3 Radiation 34
## 4 Surgery 34
## 5 Surgery + Chemotherapy 34
## 6 Surgery + Radiation 34
## 7 Surgery + Radiation therapy 34
survival_average_fig
Based on the provided average survival months for each treatment type:
These insights suggest that, based on the data provided, there is no variation in average survival months across different treatment types. This may imply that, in the context of average survival, these treatments are considered equally effective, at least as represented by the average survival month metric.
correlation_matrix <- cor(numeric_columns)
correlation_matrix
## Patient.ID Age
## Patient.ID 1.000000000 -0.009721277
## Age -0.009721277 1.000000000
## Time.to.Recurrence..months. NA NA
## Survival.Time..months. -0.016479283 0.066281516
## Time.to.Recurrence..months. Survival.Time..months.
## Patient.ID NA -0.01647928
## Age NA 0.06628152
## Time.to.Recurrence..months. 1 NA
## Survival.Time..months. NA 1.00000000
Based on the correlation matrix :
Patient.ID and Age: There is a very weak negative correlation (approximately -0.0097) between Patient ID and Age. This suggests that there is no meaningful relationship between patient identification numbers and age in this dataset.
Age and Survival Time (months): There is a very weak positive correlation (approximately 0.0663) between Age and Survival Time. This implies that older age may slightly correlate with longer survival time, although the correlation is quite weak.
Time to Recurrence (months): The correlation
coefficient is not available (NA) for Time to Recurrence
with other variables, indicating insufficient data or variability in
this particular dataset column.
Survival Time (months): There is no correlation
reported (NA) between Survival Time and Patient ID or Time
to Recurrence. The correlation with Age is weakly positive
(approximately 0.0663), suggesting a slight tendency for older patients
to have longer survival times.
The weak correlations observed suggest that age may have a slight influence on survival time, albeit not strongly. The lack of correlation with Patient ID and Time to Recurrence indicates that these variables may not be directly associated with survival outcomes in this dataset.
Further analysis or additional variables may be needed to better understand the factors influencing survival times or recurrence rates in the context of brain tumor patients.